We are using CXR images of the chest, which are in DiCom format to classify patients as having Pneumonia or not. For this we will be using object detection techniques to determine how much of the lung area is cloudy and then show bounding boxes around the areas which have patches of pneumonia.
from google.colab import drive
drive.mount('/content/drive')
# Import necessary packages
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from skimage import io
from matplotlib.patches import Rectangle
!pip install -q pydicom
import pydicom
import pylab
# Path to the trained images dataset
project_path = '/content/drive/My Drive/DLCP/CAPSTONE/'
input_file = project_path + "rsna-pneumonia-detection-challenge.zip"
# Extract project files into it
from zipfile import ZipFile
with ZipFile(input_file, 'r') as z:
z.extractall()
# Train and Test image folder and .csv files are extracted into the mentioned directory
PATH = "./"
print(os.listdir(PATH))
# Use pandas to read the .csv files with class info and the mask labels for training
class_info_df = pd.read_csv(PATH+'stage_2_detailed_class_info.csv')
train_labels_df = pd.read_csv(PATH+'stage_2_train_labels.csv')
# Get shape of the data and take a quick look at a few lines
print('Detailed Class Info shape:', class_info_df.shape)
class_info_df.sample(5)
# Get shape of the data and take a quick look at a few lines
print('Training Labels shape:', train_labels_df.shape)
train_labels_df.sample(5)
Detailed Class Info
Training Labels
Identify data across categories
# Check the number of records for each class
print(class_info_df['class'].value_counts())
f, ax = plt.subplots(1,1, figsize=(6,4))
sns.countplot(class_info_df['class'], data=class_info_df, order = class_info_df['class'].value_counts().index)
total = float(len(class_info_df))
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}% ({:d})'.format(100*height/total, height),
ha="center")
plt.show()
# All classes have approximately same number of records
# Check the number of records for each 'Target' value
target_counts = train_labels_df['Target'].value_counts()
print(target_counts)
sns.countplot(train_labels_df['Target'], data=train_labels_df)
# The target value of 0 has twice the number of records as that of 1.
# However, the target value of '0' includes 2 classes - 'No Lung Opacity / Not Normal' & 'Normal'
Identify Missing Data
# Define a function to calculate the percentage of missing values for each of the columns
def count_missing_data(df):
print('Percentage of missing values for each column:')
for col in df.columns:
print(col, ':', df[col].isnull().sum()/len(df[col]))
count_missing_data(class_info_df)
# There are no missing values in the class details df
# Check if there is missing data in any columns
count_missing_data(train_labels_df)
# 68.3% of the rows are missing bbox coordinates, ie the x, y, height and width columns
# Check if there are any columns with missing data when the target col is 1, that is when a patient is diagnosed with Pneumonia
count_missing_data(train_labels_df[(train_labels_df['Target']==1)])
# For a patient diagnosed with Pneumonia, none of the columns are missing data.
# All the columns with missing data are from the patients without Pneumonia(both 'Normal' and 'No Lung Opacity / Not Normal')
# 68.3% is the exact number of rows with negative for Pneumonia, which means the bounding boxes are missing only for non-Pneumonia afflicted patients
# Before merging the two dfs, check if all the patientIds are present in both the class and labels datasets
df_diff = pd.concat([train_labels_df['patientId'],class_info_df['patientId']]).drop_duplicates(keep=False)
print('Number of patientIds in train_labels_df, but not in class_info_df:', len(df_diff))
df_diff = pd.concat([class_info_df['patientId'], train_labels_df['patientId']]).drop_duplicates(keep=False)
print('Number of patientIds in class_info_df, but not in train_labels_df:', len(df_diff))
# All patientIds are present in both the files.
There are no missing values in Class Info data.
68.3% of the values are missing in the Training Labels data. The missing data is for Target = 0, which includes 'Normal' & 'No Lung Opacity / Not Normal'
In the train set, the percent of data for Target=1 is 31.61%
Merge Train Labels and Class Info details
# Merge the two datasets, using patientId as the key
train_df = pd.merge(class_info_df, train_labels_df, on='patientId', how='inner')
train_df.drop_duplicates(inplace=True)
train_df.shape
# Plot the number of examinations for each class detected, grouped by Target value.
fig, ax = plt.subplots(nrows=1,figsize=(10,6))
ax = sns.countplot(x='Target', data=train_df, hue='class')
plt.title("Chest exams class and Target")
plt.ylabel("Exams")
All chest exams with Target=1, ie Pneumonia Detected, are associated with Class=Lung Opacity
For Target = 0, ie no Pneumonia Detected, includes 'Normal' & 'No Lung Opacity / Not Normal' classes
Analyze the Lung Opacity Window
# For the class Lung Opacity, corresponding to values of Target = 1, we plot the density of x, y, width and height.
target1 = train_df[train_df['Target']==1]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(18,8))
sns.distplot(target1['x'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target1['y'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target1['width'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target1['height'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()
# Plot the centers of rectangles in brown superimposed with the actual bbox rectangle in yellow
fig, ax = plt.subplots(1,1,figsize=(7,7))
target_sample = target1.sample(2000)
target_sample['xc'] = target_sample['x'] + target_sample['width'] / 2 # xmin + width/2 gives the xcenter of the bbox
target_sample['yc'] = target_sample['y'] + target_sample['height'] / 2 # ymin + height/2 gives the ycenter of the bbox
plt.title("Centers of Lung Opacity rectangles (brown) over rectangles (yellow)\nSample size: 2000")
target_sample.plot.scatter(x='xc', y='yc', xlim=(0,1024), ylim=(0,1024), ax=ax, alpha=0.8, marker=".", color="brown")
for i, crt_sample in target_sample.iterrows():
ax.add_patch(Rectangle(xy=(crt_sample['x'], crt_sample['y']),
width=crt_sample['width'],height=crt_sample['height'],alpha=3.5e-3, color="yellow"))
plt.show()
Setup Paths for Training and Test Image Datasets
# Set the train and test images directories
train_images_dir = PATH+'stage_2_train_images/'
test_images_dir = PATH+'stage_2_test_images/'
print("Number of images in train set:", len(os.listdir(train_images_dir)),"\nNumber of images in test set:", len(os.listdir(test_images_dir)))
# Read a sample of datafiles
image_sample_path = os.listdir(train_images_dir)[:5]
print(image_sample_path)
print("Unique patientId in train_df:", train_df['patientId'].nunique()) # Check for duplicate records in training set
print("Total rows in train_df:", train_df.shape[0])
# Since the total rows are more than the number of patients, we can deduce that there are multiple rows for some patients
# Let's plot the number of examinations for each class detected, grouped by Target value.
tmp = train_df.groupby(['patientId','Target','class'])['patientId'].count()
df = pd.DataFrame(data={'Exams': tmp.values}, index=tmp.index).reset_index()
tmp = df.groupby(['Exams','Target','class']).count()
df2 = pd.DataFrame(data=tmp.values, index=tmp.index).reset_index()
df2.columns = ['Exams','Target','Class','Entries']
df2
sns.barplot(x=df2['Exams'], y=df2['Entries'], hue=df2['Class'])
Examine the Images
samplePatientID = list(train_df[:3].T.to_dict().values())[0]['patientId']
samplePatientID = samplePatientID+'.dcm'
dicom_file_dataset = pydicom.dcmread(train_images_dir + samplePatientID)
dicom_file_dataset
# For 500 patients with Pneumonia, get info about their age and gender
# Currently able to get info for only 500 images, as it is taking a long time.
temp_df = train_df[train_df['Target']==1]
img_data = list(temp_df.T.to_dict().values())
# Create empty df to hold patient info
stats_df = pd.DataFrame(columns=['PatientId','Age','Sex'])
# Using pydicom package, access the dataset of each image
for i, data_row in enumerate(img_data[0:500]):
img_file = data_row['patientId'] + '.dcm'
img = pydicom.dcmread(train_images_dir + img_file)
# Append Patient's age and gender info to the df
stats_df = stats_df.append({'PatientId':img.PatientID, 'Age':int(img.PatientAge), 'Sex':img.PatientSex}, ignore_index=True)
len(stats_df)
# Plot the frequency of patients with Pneumonia in each age group
plt.figure(1, figsize=(20,6))
sns.countplot(stats_df['Age'])
# Number of Pneumonia patients is large between age groups of 40 to 60
# Plot count of patients with Pneumonia by gender
sns.countplot(stats_df['Sex'])
# Number of males suffering from pneumonia is greater than females
# Compare freq of patients with Pneumonia by gender and Age
plt.figure(1, figsize=(20,6))
sns.countplot(stats_df['Age'], hue=stats_df['Sex'])
Plot Dicom Images with Target=1
def show_dicom_images(data):
img_data = list(data.T.to_dict().values())
fig, ax = plt.subplots(3,3,figsize=(16,18))
for i, data_row in enumerate(img_data):
img_file = data_row['patientId'] + '.dcm'
img = pydicom.dcmread(train_images_dir + img_file)
modality = img.Modality
age = img.PatientAge
sex = img.PatientSex
ax[i//3, i%3].imshow(img.pixel_array, cmap=plt.cm.bone)
ax[i//3, i%3].axis('off')
ax[i//3, i%3].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}\nClass: {}\nWindow: {}:{}:{}:{}'.format(
data_row['patientId'], modality, age, sex, data_row['Target'], data_row['class'],
data_row['x'],data_row['y'],data_row['width'],data_row['height']))
plt.show()
show_dicom_images(train_df[train_df['Target']==1].head(9))
def show_dicom_images_with_boxes(data):
img_data = list(data.T.to_dict().values())
f, ax = plt.subplots(3, 3, figsize=(16,18))
for i, data_row in enumerate(img_data):
img_file = data_row['patientId'] + '.dcm'
img = pydicom.dcmread(train_images_dir + img_file)
modality = img.Modality
age = img.PatientAge
sex = img.PatientSex
ax[i//3, i%3].imshow(img.pixel_array, cmap=plt.cm.bone)
ax[i//3, i%3].axis('off')
ax[i//3, i%3].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}\nClass: {}\nWindow: {}:{}:{}:{}'.format(
data_row['patientId'], modality, age, sex, data_row['Target'], data_row['class'],
data_row['x'],data_row['y'],data_row['width'],data_row['height']))
rows = train_df[train_df['patientId']==data_row['patientId']]
box_data = list(rows.T.to_dict().values())
for j, row in enumerate(box_data):
ax[i//3, i%3].add_patch(Rectangle(xy=(row['x'], row['y']),
width=row['width'],height=row['height'],
color="blue",alpha = 0.1))
plt.show()
show_dicom_images_with_boxes(train_df[train_df['Target']==1].head(9))
Plot DICOM Images with Target=0
# Since CXR images of patients with no Pneumonia do not have opacities, we do not display bounding boxes
show_dicom_images(train_df[train_df['Target']==0].head(9))
# Define the model parameters
IMAGE_SIZE = 256
BATCH_SIZE = 32 # Depends on your GPU or CPU RAM.
# Store all bbox coordinates for pneumonia patches in a dictionary
pneumonia_locations = {}
for i, row in train_labels_df.iterrows():
filename = row[0]
location = row[1:5]
pneumonia = row[5]
# If pneumonia is present, add patientId and coordinates to dict.
if pneumonia == 1:
location = [int(float(i)) for i in location]
if filename in pneumonia_locations:
pneumonia_locations[filename].append(location) # If there are multiple bboxs for one patient, each set is stored as a sublist
else:
pneumonia_locations[filename] = [location]
# Load and Shuffle filenames
import random
folder = PATH+'stage_2_train_images/'
filenames = os.listdir(folder)
random.shuffle(filenames)
# Split into train and validation filenames
n_valid_samples = 2500
train_filenames = filenames[n_valid_samples:]
valid_filenames = filenames[:n_valid_samples]
print('n train samples:', len(train_filenames))
print('n test samples:', len(valid_filenames))
n_train_samples = len(filenames) - n_valid_samples
print('Total Train Images =', len(filenames))
print('Images with Pneumonia =', len(pneumonia_locations))
# Data Generator for CLASSIFICATION using Sequence for multiprocessing
import math
from skimage.transform import resize
from tensorflow.keras.utils import Sequence
class ClassificationDataGenerator(Sequence):
def __init__(self, folder, filenames, pneumonia_locations=None, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=True, augment=False, predict=False):
self.folder = folder
self.filenames = filenames
self.pneumonia_locations = pneumonia_locations
self.batch_size = batch_size
self.image_size = image_size
self.shuffle = shuffle
self.augment = augment
self.predict = predict
self.on_epoch_end()
def __load__(self, filename):
# Load DICOM file as numpy array
img = pydicom.dcmread(self.folder + filename).pixel_array
filename = filename.split('.')[0] # Remove the file extension
if filename in pneumonia_locations:
tgt = 1 # Has Pneumonia
else:
tgt = 0 # Does not have Pneumonia
# If augment flag is on, then flip the image and mask horizontally half the times
if self.augment and random.random() > 0.5:
img = np.fliplr(img)
# Resize the image
img = resize(img, (self.image_size, self.image_size), mode='symmetric')
# Add the channel dimension to the image files
img = np.expand_dims(img, -1)
return img, tgt
def __loadpredict__(self, filename):
# Load DICOM file as numpy array
img = pydicom.dcmread(self.folder + filename).pixel_array
img = resize(img, (self.image_size, self.image_size), mode='symmetric')
img = np.expand_dims(img, -1)
return img
def __getitem__(self, idx):
# select batch
filenames = self.filenames[idx*self.batch_size : (idx + 1)*self.batch_size] # Image path
if self.predict:
# Load files for this batch
imgs = [self.__loadpredict__(filename) for filename in filenames]
imgs = np.array(imgs) # Create numpy arrays for batch images
return imgs, filenames
else:
# Load files for this batch
items = [self.__load__(filename) for filename in filenames]
imgs, tgts = zip(*items) # Output of __load__ is a tuple with imgs and targets, so Unzip the images and masks
imgs = np.array(imgs) # Create numpy arrays for batch images
tgts = np.array(tgts) # Create numpy arrays for targets
return imgs, tgts
def on_epoch_end(self):
if self.shuffle:
random.shuffle(self.filenames)
def __len__(self):
if self.predict:
return int(np.ceil(len(self.filenames) / self.batch_size))
else:
return int(len(self.filenames) / self.batch_size)
# Data Generator for REGRESSION using Sequence for multiprocessing
import math
from skimage.transform import resize
from tensorflow.keras.utils import Sequence
class DataSequence(Sequence):
def __init__(self, folder, filenames, pneumonia_locations=None, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=True, augment=False, predict=False):
self.folder = folder
self.filenames = filenames
self.pneumonia_locations = pneumonia_locations
self.batch_size = batch_size
self.image_size = image_size
self.shuffle = shuffle
self.augment = augment
self.predict = predict
self.on_epoch_end()
def __load__(self, filename):
# Load DICOM file as numpy array
img = pydicom.dcmread(self.folder + filename).pixel_array
# Create empty mask
mask = np.zeros(img.shape)
filename = filename.split('.')[0] # Remove the file extension
if filename in pneumonia_locations:
for location in pneumonia_locations[filename]:
x, y, width, height = location
mask[x:x+width, y:y+height] = 1 # Broadcast value of 1 to all the pixels within the bbox coordinates
# If augment flag is on, then flip the image and mask horizontally half the times
if self.augment and random.random() > 0.5:
img = np.fliplr(img)
mask = np.fliplr(mask)
# Resize both the image and mask
img = resize(img, (self.image_size, self.image_size), mode='symmetric')
mask = resize(mask, (self.image_size, self.image_size), mode='symmetric') > 0.5
# Add the channel dimension to both the image and masks files
img = np.expand_dims(img, -1)
mask = np.expand_dims(mask, -1)
return img, mask
def __loadpredict__(self, filename):
# Load DICOM file as numpy array
img = pydicom.dcmread(self.folder + filename).pixel_array
img = resize(img, (self.image_size, self.image_size), mode='symmetric')
img = np.expand_dims(img, -1)
return img
def __getitem__(self, idx):
# select batch
filenames = self.filenames[idx*self.batch_size : (idx + 1)*self.batch_size] # Image path
if self.predict:
# Load files for this batch
imgs = [self.__loadpredict__(filename) for filename in filenames]
imgs = np.array(imgs) # Create numpy arrays for batch images
return imgs, filenames
else:
# Load files for this batch
items = [self.__load__(filename) for filename in filenames]
imgs, msks = zip(*items) # Output of __load__ is a tuple with imgs and masks, so Unzip the images and masks
imgs = np.array(imgs) # Create numpy arrays for batch images
msks = np.array(msks) # Create numpy arrays for batch masks
return imgs, msks
def on_epoch_end(self):
if self.shuffle:
random.shuffle(self.filenames)
def __len__(self):
if self.predict: # return everything
return int(np.ceil(len(self.filenames) / self.batch_size))
else: # return batches of size = BATCH_SIZE
return int(len(self.filenames) / self.batch_size)
# Define IOU or Jaccard Loss function
def iou_loss(y_true, y_pred):
y_true = tf.reshape(y_true, [-1])
y_pred = tf.reshape(y_pred, [-1])
intersection = tf.reduce_sum(y_true * y_pred)
score = (intersection + 1.) / (tf.reduce_sum(y_true) + tf.reduce_sum(y_pred) - intersection + 1.)
return 1 - score
# Combine BCE Loss and IOU Loss
def iou_bce_loss(y_true, y_pred):
return 0.5 * keras.losses.binary_crossentropy(y_true, y_pred) + 0.5 * iou_loss(y_true, y_pred)
# Mean IOU as a metric
def mean_iou(y_true, y_pred):
y_pred = tf.round(y_pred)
intersect = tf.reduce_sum(y_true * y_pred, axis=[1, 2, 3])
union = tf.reduce_sum(y_true, axis=[1, 2, 3]) + tf.reduce_sum(y_pred, axis=[1, 2, 3])
smooth = tf.ones(tf.shape(intersect))
return tf.reduce_mean((intersect + smooth) / (union - intersect + smooth))
# Import necessary packages
import tensorflow as tf
from tensorflow import keras
from skimage import measure
from keras.models import Sequential, Model
from keras.layers import Conv2D, MaxPool2D, SeparableConv2D, UpSampling2D
from keras.layers import Dense, Flatten, Dropout, BatchNormalization, Input, MaxPooling2D
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
# CLASSIFICATION MODEL
inputs = Input(shape=(IMAGE_SIZE, IMAGE_SIZE, 1))
# First conv block
x = Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same')(inputs)
x = Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = MaxPool2D(pool_size=(2, 2))(x)
# Second conv block
x = SeparableConv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
# Third conv block
x = SeparableConv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
# Fourth conv block
x = SeparableConv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
x = Dropout(rate=0.2)(x)
# Fifth conv block
x = SeparableConv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
x = Dropout(rate=0.2)(x)
# FC layer
x = Flatten()(x)
x = Dense(units=512, activation='relu')(x)
x = Dropout(rate=0.7)(x)
x = Dense(units=128, activation='relu')(x)
x = Dropout(rate=0.5)(x)
x = Dense(units=64, activation='relu')(x)
x = Dropout(rate=0.3)(x)
# Output layer
output = Dense(units=1, activation='sigmoid')(x)
# Creating model and compiling
classifier = Model(inputs=inputs, outputs=output)
classifier.summary()
# Compile classification model and track class level metrics apart from accuracy
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.AUC(), tf.keras.metrics.TruePositives(), tf.keras.metrics.TrueNegatives(), tf.keras.metrics.FalsePositives(), tf.keras.metrics.FalseNegatives()])
# Define Callbacks
checkpoint = ModelCheckpoint(filepath='best_weights_class.hdf5', save_best_only=True, save_weights_only=True)
lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.3, patience=1, verbose=2, mode='max')
# Generate training and validation datasets for Classification Model
folder = PATH +'stage_2_train_images/'
class_train_gen = ClassificationDataGenerator(folder, train_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=False, augment=True)
class_valid_gen = ClassificationDataGenerator(folder, valid_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=False)
# Train the classification model
class_history = classifier.fit_generator(class_train_gen, validation_data=class_valid_gen, epochs=5, callbacks=[checkpoint, lr_reduce], verbose=1, shuffle=True)
# Accuracy and class level metrics captured per epoch
from keras.models import model_from_yaml
# Serialize classification model to YAML
class_yaml = classifier.to_yaml()
with open(project_path+"class_model3.yaml", "w") as yaml_file:
yaml_file.write(class_yaml)
# Serialize weights to HDF5
classifier.save_weights(project_path+"class_model3_weights.h5")
print("Saved classifier model to disk")
# Gather training Loss, Accuracy and other class level metrics
df_classi_metrics = pd.DataFrame(columns=['Model','Epoch','Train loss','Valid loss','Train accuracy','Valid accuracy','Train Precision','Valid Precision','Train Recall','Valid Recall','Train AUC','Valid AUC','Train True Positives','Train True Negatives','Train False Positives','Train False Negatives','Valid True Positives','Valid True Negatives','Valid False Positives','Valid False Negatives'])
for i, epoch_num in enumerate(class_history.epoch):
df_classi_metrics = df_classi_metrics.append({
'Model':'Classification Model',
'Epoch': int(epoch_num),
'Train loss':class_history.history["loss"][i],
'Valid loss':class_history.history["val_loss"][i],
'Train accuracy':class_history.history["accuracy"][i],
'Valid accuracy':class_history.history["val_accuracy"][i],
'Train Precision':class_history.history['precision'][i],
'Valid Precision':class_history.history["val_precision"][i],
'Train Recall':class_history.history['recall'][i],
'Valid Recall':class_history.history["val_recall"][i],
'Train AUC':class_history.history['auc'][i],
'Valid AUC':class_history.history["val_auc"][i],
'Train True Positives':class_history.history['true_positives'][i],
'Train True Negatives':class_history.history['true_negatives'][i],
'Train False Positives':class_history.history['false_positives'][i],
'Train False Negatives':class_history.history['false_negatives'][i],
'Valid True Positives':class_history.history['val_true_positives'][i],
'Valid True Negatives':class_history.history['val_true_negatives'][i],
'Valid False Positives':class_history.history['val_false_positives'][i],
'Valid False Negatives':class_history.history['val_false_negatives'][i]},
ignore_index=True)
df_classi_metrics.to_csv(project_path + 'classification_metrics.csv')
df_classi_metrics
history = class_history
plt.figure(figsize=(20,10))
plt.subplot(341)
plt.plot(history.epoch, history.history["loss"], label="Train loss")
plt.plot(history.epoch, history.history["val_loss"], label="Valid loss")
plt.legend()
plt.subplot(342)
plt.plot(history.epoch, history.history["accuracy"], label="Train accuracy")
plt.plot(history.epoch, history.history["val_accuracy"], label="Valid accuracy")
plt.legend()
plt.subplot(343)
plt.plot(history.epoch, history.history['auc'], label="Train AUC")
plt.plot(history.epoch, history.history["val_auc"], label="Valid AUC")
plt.legend()
plt.subplot(345)
plt.plot(history.epoch, history.history['precision'], label="Train Precision")
plt.plot(history.epoch, history.history["val_precision"], label="Valid Precision")
plt.legend()
plt.subplot(346)
plt.plot(history.epoch, history.history['recall'], label="Train Recall")
plt.plot(history.epoch, history.history["val_recall"], label="Valid Recall")
plt.legend()
plt.subplot(349)
plt.plot(history.epoch, history.history["true_positives"], label="Train TP")
plt.plot(history.epoch, history.history["val_true_positives"], label="Valid TP")
plt.legend()
plt.subplot(3,4,10)
plt.plot(history.epoch, history.history["true_negatives"], label="Train TN")
plt.plot(history.epoch, history.history["val_true_negatives"], label="Valid TN")
plt.legend()
plt.subplot(3,4,11)
plt.plot(history.epoch, history.history["false_positives"], label="Train TN")
plt.plot(history.epoch, history.history["val_false_positives"], label="Valid TN")
plt.legend()
plt.subplot(3,4,12)
plt.plot(history.epoch, history.history["false_negatives"], label="Train TN")
plt.plot(history.epoch, history.history["val_false_negatives"], label="Valid TN")
plt.legend()
plt.show()
# Load classification YAML and create model
with open(project_path + 'class_model3.yaml', 'r') as yaml_file:
loaded_model_yaml = yaml_file.read()
loaded_class_model3 = model_from_yaml(loaded_model_yaml)
# Load weights into new model
loaded_class_model3.load_weights(project_path + "class_model3_weights.h5")
print("Loaded classification model from disk")
# loaded_class_model3.summary()
# Generate the test data for the entire testdata
folder = PATH + 'stage_2_test_images/'
test_filenames = os.listdir(folder)
print('n test samples', len(test_filenames))
# Set predict=True so the entire dataset is returned at once
class_test_gen = ClassificationDataGenerator(folder, test_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=True)
# Create submission dictionary
class_preds_dict = {}
# Loop through testset
for imgs, filenames in class_test_gen:
# predict batch of images
preds = classifier.predict(imgs)
# loop through batch
for pred, filename in zip(preds, filenames):
# add filename and predictionString to dictionary
filename = filename.split('.')[0]
class_preds_dict[filename] = pred
# stop if we've got them all
if len(class_preds_dict) >= len(test_filenames):
break
# Save dictionary as csv file
sub = pd.DataFrame.from_dict(class_preds_dict, orient='index')
sub.index.names = ['patientId']
sub.columns = ['hasPneumonia']
sub.to_csv(project_path + 'PneumoniaClassPrediction.csv')
# Creates a class prediction file
sub.head()
# Model 1 with Conv2D Layers
model1 = Sequential()
# Add Convolution layers with 32 kernels of 3X3 shape with activation function ReLU
model1.add(Conv2D(4, (3, 3), input_shape = (IMAGE_SIZE, IMAGE_SIZE, 1), activation = 'relu', padding = 'same'))
model1.add(MaxPooling2D(pool_size = (2, 2))) # Max Pooling layer of size 2X2
model1.add(Conv2D(16, (2, 2), activation = 'relu', padding = 'same'))
model1.add(MaxPooling2D(pool_size = (2, 2)))
model1.add(BatchNormalization())
model1.add(Conv2D(64, (2, 2), activation = 'relu', padding = 'same'))
model1.add(BatchNormalization())
model1.add(MaxPooling2D(pool_size = (2, 2)))
model1.add(Dropout(0.4))
model1.add(Conv2D(128, (2, 2), activation = 'relu', padding = 'same'))
model1.add(MaxPooling2D(pool_size = (2, 2)))
model1.add(Dropout(0.3))
# 1 unit with Sigmoid activation for binary classification
model1.add(Conv2D(1, 1, activation='sigmoid'))
# UpSampling Layer to bring the image back to the input size
model1.add(UpSampling2D(16))
model1.summary()
# Compile the model
model1.compile(optimizer='adam', loss=iou_bce_loss, metrics=['accuracy', mean_iou, tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.AUC()])
# Define Callbacks
checkpoint = ModelCheckpoint(filepath='best_weights.hdf5', save_best_only=True, save_weights_only=True)
lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.3, patience=5, verbose=2, mode='max')
#create train and validation generators
folder = PATH+'./stage_2_train_images'
train_gen = DataSequence(folder, train_filenames, pneumonia_locations, batch_size=32, image_size=256, shuffle=False, predict=False, augment= True)
valid_gen = DataSequence(folder, valid_filenames, pneumonia_locations, batch_size=32, image_size=256, shuffle=False, predict=False)
# Fit the model
#history = model_1.fit_generator(train_gen, validation_data=valid_gen, callbacks=[checkpoint, lr_reduce], epochs=5, shuffle=True, verbose=1)
# model1 Save model in yaml format
from keras.models import model_from_yaml
# Serialize model to YAML
model_yaml = model1.to_yaml()
with open(project_path+"model_1.yaml", "w") as yaml_file:
yaml_file.write(model_yaml)
# Serialize weights to HDF5
model1.save_weights(project_path+"model_1.h5")
print("Saved regression model to disk")
# Get Training Loss, Accuracy from model history
df_regr_metrics = pd.DataFrame(columns=['Model','Epoch','Train loss','Valid loss','Train accuracy','Valid accuracy','Train iou','Valid iou','Train Precision','Valid Precision','Train Recall','Valid Recall','Train AUC','Valid AUC'])
for i, epoch_num in enumerate(history.epoch):
df_regr_metrics = df_regr_metrics.append({
'Model':'Model1',
'Epoch': int(epoch_num),
'Train loss':history.history["loss"][i],
'Valid loss':history.history["val_loss"][i],
'Train accuracy':history.history["accuracy"][i],
'Valid accuracy':history.history["val_accuracy"][i],
'Train iou':history.history["mean_iou"][i],
'Valid iou':history.history["val_mean_iou"][i],
'Train Precision':history.history['precision'][i],
'Valid Precision':history.history["val_precision"][i],
'Train Recall':history.history['recall'][i],
'Valid Recall':history.history["val_recall"][i],
'Train AUC':history.history['auc'][i],
'Valid AUC':history.history["val_auc"][i]}, ignore_index=True)
# saving the dataframe
df_regr_metrics.to_csv(PATH + 'model1_regr_metrics.csv')
df_regr_metrics
# Load regression YAML and create model
from keras.models import model_from_yaml
with open(project_path + 'model_1.yaml', 'r') as yaml_file:
loaded_model_yaml = yaml_file.read()
loaded_regr_model1 = model_from_yaml(loaded_model_yaml)
# Load weights into new model
loaded_regr_model1.load_weights(project_path + "model_1.h5")
print("Loaded regression model from disk")
# loaded_regr_model1.summary()
# Model 2 with SeperableConv2D Layers
model2 = Sequential() # Create and instance of Sequential Model
model2.add(Conv2D(filters=16, kernel_size=(2,2), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 1), activation='relu', padding='same'))
model2.add(MaxPooling2D(pool_size = (2, 2)))
model2.add(SeparableConv2D(filters=32, kernel_size=(2,2), activation='relu', padding='same'))
model2.add(MaxPooling2D(pool_size = (2, 2)))
model2.add(SeparableConv2D(filters=64, kernel_size=(2,2), activation='relu', padding='same'))
model2.add(MaxPooling2D(pool_size = (2, 2)))
model2.add(SeparableConv2D(filters=128, kernel_size=(2,2), activation='relu', padding='same'))
model2.add(MaxPooling2D(pool_size = (2, 2)))
model2.add(SeparableConv2D(filters=256, kernel_size=(2,2), activation='relu', padding='same'))
model2.add(MaxPooling2D(pool_size = (2, 2)))
model2.add(SeparableConv2D(filters=256, kernel_size=(2,2), activation='relu', padding='same'))
model2.add(BatchNormalization(momentum=0.9))
model2.add(MaxPooling2D(pool_size = (2, 2)))
model2.add(Conv2D(1, 1, activation='sigmoid'))
model2.add(UpSampling2D())
model2.add(UpSampling2D())
model2.add(UpSampling2D())
model2.add(UpSampling2D())
model2.add(UpSampling2D())
model2.add(UpSampling2D())
model2.summary()
# Compile model using Adam optimizer, combination of BCE and IOU loss, and monitor model performance using accuracy and average IOU
model2.compile(optimizer='adam', loss=iou_bce_loss, metrics=['accuracy', mean_iou, tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.AUC()])
# Define Callbacks
checkpoint = ModelCheckpoint(filepath='best_weights.hdf5', save_best_only=True, save_weights_only=True)
lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.3, patience=1, verbose=2, mode='max')
# Generate training and validation datasets for Regression Model
folder = PATH +'stage_2_train_images/'
train_gen = DataSequence(folder, train_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=False, augment=True)
valid_gen = DataSequence(folder, valid_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=False)
# Fit Model3
#history2 = model2.fit_generator(train_gen, validation_data=valid_gen, epochs=5, callbacks=[checkpoint, lr_reduce], verbose=1, shuffle=True)
model2.save('/content/drive/My Drive/caption project/model_2_5.h5')
# Save model in yaml format
from keras.models import model_from_yaml
# Serialize model to YAML
model_yaml = model2.to_yaml()
with open(project_path+"model_2_5.yaml", "w") as yaml_file:
yaml_file.write(model_yaml)
# Serialize weights to HDF5
model2.save_weights(project_path+"model2_weights.h5")
print("Saved regression model to disk")
# Get Training Loss, Accuracy from model history
df_regr_metrics = pd.DataFrame(columns=['Model','Epoch','Train loss','Valid loss','Train accuracy','Valid accuracy','Train iou','Valid iou','Train Precision','Valid Precision','Train Recall','Valid Recall','Train AUC','Valid AUC'])
history = history2
for i, epoch_num in enumerate(history.epoch):
df_regr_metrics = df_regr_metrics.append({
'Model':'Model2',
'Epoch': int(epoch_num),
'Train loss':history.history["loss"][i],
'Valid loss':history.history["val_loss"][i],
'Train accuracy':history.history["accuracy"][i],
'Valid accuracy':history.history["val_accuracy"][i],
'Train iou':history.history["mean_iou"][i],
'Valid iou':history.history["val_mean_iou"][i],
'Train Precision':history.history['precision'][i],
'Valid Precision':history.history["val_precision"][i],
'Train Recall':history.history['recall'][i],
'Valid Recall':history.history["val_recall"][i],
'Train AUC':history.history['auc'][i],
'Valid AUC':history.history["val_auc"][i]}, ignore_index=True)
df_regr_metrics.to_csv(project_path+ 'model2_regr_metrics.csv')
df_regr_metrics
# Load regression YAML and create model
with open(project_path + 'model_2_5.yaml', 'r') as yaml_file:
loaded_model_yaml = yaml_file.read()
loaded_regr_model2 = model_from_yaml(loaded_model_yaml)
# Load weights into new model
loaded_regr_model2.load_weights(project_path + "model_2_5.h5")
print("Loaded regression model from disk")
# loaded_regr_model2.summary()
def create_downsample(channels, inputs):
x = keras.layers.BatchNormalization(momentum=0.9)(inputs)
x = keras.layers.LeakyReLU(0)(x)
x = keras.layers.Conv2D(channels, 1, padding='same', use_bias=False)(x)
x = keras.layers.MaxPool2D(2)(x)
# Added start
#x = keras.loyers.Conv2D(channeLs, 1, padding.'same', use bias-FaLse)(x)
#x = keras.loyers.MaxPooL20(2)(x)
# Added End
return x
def create_resblock(channels, inputs):
x = keras.layers.BatchNormalization(momentum=0.9)(inputs)
x = keras.layers.LeakyReLU(0)(x)
x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(x)
x = keras.layers.BatchNormalization(momentum=0.9)(x)
x = keras.layers.LeakyReLU(0)(x)
x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(x)
#Added Start
x = keras.layers.BatchNormalization(momentum=0.9)(x)
x = keras.layers.LeakyReLU(0)(x)
x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(x)
#Added End
addInput = x;
print("Add input shape:", addInput.shape)
print("Resnet block input shape:", inputs.shape)
resBlockOut = keras.layers.add([addInput, inputs])
print("Resnet block out shape:", resBlockOut.shape)
out = keras.layers.concatenate([resBlockOut, addInput], axis=3)
print("concat block out shape:", out.shape)
out = keras.layers.Conv2D(channels, 1, padding='same', use_bias=False)(out)
print("mixed block out shape:", out.shape)
return out
def create_network(input_size, channels, n_blocks=2, depth=4):
# input
inputs = keras.Input(shape=(input_size, input_size, 1))
x = keras.layers.Conv2D(channels, 3, padding='same', use_bias=False)(inputs)
# residuaL bLocks
for d in range(depth):
channels = channels * 2
x = create_downsample(channels, x)
for b in range(n_blocks):
x = create_resblock(channels, x)
# output
x = keras.layers.BatchNormalization(momentum=0.9)(x)
x = keras.layers.LeakyReLU(0)(x)
x = keras.layers.Conv2D(1, 1, activation='sigmoid')(x)
outputs = keras.layers.UpSampling2D(2**depth)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
return model
# create network and compiler
model = create_network(input_size=128, channels=16, n_blocks=2, depth=3)
model.compile(optimizer='adam',
loss=iou_bce_loss,
metrics=['accuracy', mean_iou, tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.AUC()])
print("model summary:", model.summary())
# cosine Learning rate annealing
def cosine_annealing(x):
lr = 0.001
epochs = 25
return lr*(np.cos(np.pi*x/epochs)+1.)/2
learning_rate = tf.keras.callbacks.LearningRateScheduler(cosine_annealing)
#create train and validation generators
folder = PATH+'./stage_2_train_images'
train_gen = DataSequence(folder, train_filenames, pneumonia_locations, batch_size=16, image_size=128, shuffle=False, augment= True)
valid_gen = DataSequence(folder, valid_filenames, pneumonia_locations, batch_size=16, image_size=128, shuffle=False, predict=False)
# Fit the model
# Commenting the fit function for now so that we should not accidentally run it
#history4 = model4.fit_generator(train_gen, validation_data=valid_gen, callbacks=(learning_rate), epochs=5, shuffle=True, verbose=1)
# Save model4 in yaml format
from keras.models import model_from_yaml
# Serialize model to YAML
model_yaml = model4.to_yaml()
with open(project_path+"model4.yaml", "w") as yaml_file:
yaml_file.write(model_yaml)
# Serialize weights to HDF5
model4.save_weights(project_path+"model4.h5")
print("Saved regression model to disk")
# Get Training Loss, Accuracy from model history
df_regr_metrics = pd.DataFrame(columns=['Model','Epoch','Train loss','Valid loss','Train accuracy','Valid accuracy','Train iou','Valid iou','Train Precision','Valid Precision','Train Recall','Valid Recall','Train AUC','Valid AUC'])
history = history4
for i, epoch_num in enumerate(history.epoch):
df_regr_metrics = df_regr_metrics.append({
'Model':'Model4',
'Epoch': int(epoch_num),
'Train loss':history.history["loss"][i],
'Valid loss':history.history["val_loss"][i],
'Train accuracy':history.history["accuracy"][i],
'Valid accuracy':history.history["val_accuracy"][i],
'Train iou':history.history["mean_iou"][i],
'Valid iou':history.history["val_mean_iou"][i],
'Train Precision':history.history["precision"][i],
'Valid Precision':history.history["val_precision"][i],
'Train Recall':history.history["recall"][i],
'Valid Recall':history.history["val_recall"][i],
'Train AUC':history.history["auc"][i],
'Valid AUC':history.history["val_auc"][i]}, ignore_index=True)
df_regr_metrics
# Save metrics to a dataframe
df_regr_metrics.to_csv(project_path + 'Model4_metrics.csv')
# MODEL TO PREDICT BBOX COORDINATES
inputs = Input(shape=(IMAGE_SIZE, IMAGE_SIZE, 1))
# First conv block
x = Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same')(inputs)
x = Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = MaxPool2D(pool_size=(2, 2))(x)
# Second conv block
x = SeparableConv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
# Third conv block
x = SeparableConv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
# Fourth conv block
x = SeparableConv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
x = Dropout(rate=0.2)(x)
# Fifth conv block
x = SeparableConv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = SeparableConv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(x)
x = BatchNormalization()(x)
x = MaxPool2D(pool_size=(2, 2))(x)
x = Dropout(rate=0.2)(x)
# Sigmoid activation for binary classification
x = Conv2D(filters=1, kernel_size=(2,2), activation='sigmoid', padding='same')(x)
output = UpSampling2D(32)(x)
# Creating model
model3 = Model(inputs=inputs, outputs=output)
model3.summary()
# Use Adam optimizer, combination of BCE and IOU loss, and monitor model performance using accuracy and average IOU
model3.compile(optimizer='adam', loss=iou_bce_loss, metrics=['accuracy', mean_iou, tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.AUC()])
# Define Callbacks
checkpoint = ModelCheckpoint(filepath='best_weights.hdf5', save_best_only=True, save_weights_only=True)
lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.3, patience=1, verbose=2, mode='max')
# Generate training and validation datasets for Regression Model
folder = PATH +'stage_2_train_images/'
train_gen = DataSequence(folder, train_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=False, augment=True)
valid_gen = DataSequence(folder, valid_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=False)
# Fit Model3
history3 = model3.fit_generator(train_gen, validation_data=valid_gen, epochs=5, callbacks=[checkpoint, lr_reduce], verbose=1, shuffle=True)
# Save model in yaml format
from keras.models import model_from_yaml
# Serialize model to YAML
model_yaml = model3.to_yaml()
with open(project_path+"model3.yaml", "w") as yaml_file:
yaml_file.write(model_yaml)
# Serialize weights to HDF5
model3.save_weights(project_path+"model3_weights.h5")
print("Saved regression model to disk")
history = history3
plt.figure(figsize=(20,6))
plt.subplot(141)
plt.plot(history.epoch, history.history["loss"], label="Train loss")
plt.plot(history.epoch, history.history["val_loss"], label="Valid loss")
plt.legend()
plt.subplot(142)
plt.plot(history.epoch, history.history["accuracy"], label="Train accuracy")
plt.plot(history.epoch, history.history["val_accuracy"], label="Valid accuracy")
plt.legend()
plt.subplot(143)
plt.plot(history.epoch, history.history["mean_iou"], label="Train iou")
plt.plot(history.epoch, history.history["val_mean_iou"], label="Valid iou")
plt.legend()
plt.subplot(144)
plt.plot(history.epoch, history.history['auc_1'], label="Train AUC")
plt.plot(history.epoch, history.history["val_auc_1"], label="Valid AUC")
plt.legend()
plt.show()
# Get Training Loss, Accuracy from model history
df_regr_metrics = pd.DataFrame(columns=['Model','Epoch','Train loss','Valid loss','Train accuracy','Valid accuracy','Train iou','Valid iou','Train Precision','Valid Precision','Train Recall','Valid Recall','Train AUC','Valid AUC'])
for i, epoch_num in enumerate(history.epoch):
df_regr_metrics = df_regr_metrics.append({
'Model':'Model3',
'Epoch': int(epoch_num),
'Train loss':history.history["loss"][i],
'Valid loss':history.history["val_loss"][i],
'Train accuracy':history.history["accuracy"][i],
'Valid accuracy':history.history["val_accuracy"][i],
'Train iou':history.history["mean_iou"][i],
'Valid iou':history.history["val_mean_iou"][i],
'Train Precision':history.history['precision_1'][i],
'Valid Precision':history.history["val_precision_1"][i],
'Train Recall':history.history['recall_1'][i],
'Valid Recall':history.history["val_recall_1"][i],
'Train AUC':history.history['auc_1'][i],
'Valid AUC':history.history["val_auc_1"][i]}, ignore_index=True)
# saving the dataframe
df_regr_metrics.to_csv(project_path + 'model3_regr_metrics.csv')
df_regr_metrics
# Load regression YAML and create model
with open(project_path + 'model3.yaml', 'r') as yaml_file:
loaded_model_yaml = yaml_file.read()
loaded_regr_model3 = model_from_yaml(loaded_model_yaml)
# Load weights into new model
loaded_regr_model3.load_weights(project_path + "model3_weights.h5")
print("Loaded regression model from disk")
# loaded_regr_model3.summary()
# Plot metrics for all the models
df_combined_model_metrics = pd.read_csv(project_path + 'combined_regr_metrics.csv') #<filename>)
sns.set_style("whitegrid")
plt.figure(figsize=(20,6))
plt.subplot(131)
sns.lineplot(x='Epoch', y='Valid accuracy', hue='Model', data=df_combined_model_metrics)
plt.subplot(132)
sns.lineplot(x='Epoch', y='Valid loss', hue='Model', data=df_combined_model_metrics)
plt.subplot(133)
sns.lineplot(x='Epoch', y='Valid AUC', hue='Model', data=df_combined_model_metrics)
# Display the actual vs predicted bounding boxes
from skimage import measure
import matplotlib.patches as patches
model = model3 #loaded_regr_model3 # This is for model3
for imgs, msks in valid_gen:
# predict batch of images
preds = model.predict(imgs)
# create figure
f, axarr = plt.subplots(4, 8, figsize=(20,20))
axarr = axarr.ravel()
axidx = 0
# Loop through batch
for img, msk, pred in zip(imgs, msks, preds):
# plot image
axarr[axidx].imshow(img[:, :, 0], cmap=plt.cm.bone)
# threshoLd true mask
comp = msk[:, :,0] > 0.5
# appLy connected components
comp = measure.label(comp)
# appLy bounding boxes
predictionString = ''
for region in measure.regionprops(comp):
# retrieve x, y, height and width
y, x, y2, x2 = region.bbox
height = y2 - y
width = x2 - x
axarr[axidx].add_patch(patches.Rectangle((x,y),width,height,linewidth=2,edgecolor='b',facecolor='none'))
# threshold predicted mask
comp = pred[:, :, 0] > 0.5
# apply connected components
comp = measure.label(comp)
# apply bounding boxes
predictionString = ''
for region in measure.regionprops(comp):
# retrieve x, y, height and width
y, x, y2, x2 = region.bbox
height = y2 - y
width = x2 - x
axarr[axidx].add_patch(patches.Rectangle((x,y),width,height,linewidth=2,edgecolor='r',facecolor ='none'))
axidx += 1
plt.show()
# onLy plot one batch
break
Submission File Creation
# Generate the test data for the entire testdata
folder = PATH + 'stage_2_test_images/'
test_filenames = os.listdir(folder)
print('n test samples', len(test_filenames))
# Set predict=True so the entire dataset is returned at once
test_gen = DataSequence(folder, test_filenames, pneumonia_locations, batch_size=BATCH_SIZE, image_size=IMAGE_SIZE, shuffle=False, predict=True)
submission_dict = {}
# Predict the bbox coordinates for the test dataset and
for imgs, filenames in test_gen:
# predict batch of images
preds = model.predict(imgs)
for filename, pred in zip(filenames, preds):
# resize predicted mask
pred = resize(pred, (1024, 1024), mode='reflect')
# threshold predicted mask
comp = pred[:, :, 0] > 0.5
# apply connected components
comp = measure.label(comp)
# apply bounding boxes
predictionString = ''
for region in measure.regionprops(comp):
# retrieve x, y, height and width
y, x, y2, x2 = region.bbox
height = y2 - y
width = x2 - x
# proxy for confidence score
conf = np.mean(pred[y:y+height, x:x+width])
predictionString += str(conf) + ' ' + str(x) + ' ' + str(y) + ' ' + str(width) + ' ' + str(height) + ' '
filename = filename.split('.')[0]
submission_dict[filename] = predictionString
# stop if we've got them all
if len(submission_dict) >= len(test_filenames):
break
# save dictionary as csv file
sub = pd.DataFrame.from_dict(submission_dict, orient='index')
sub.index.names = ['patientId']
sub.columns = ['PredictionString']
sub.to_csv(project_path+'submission.csv')
Analysis of Predictions
# Read submission data into a dataframe for some further analysis
submission_df = pd.read_csv(project_path+'submission.csv')
submission_df.head()
# Check how many patients are diagnosed with Pneumonia
temp = pd.DataFrame(submission_df.groupby(submission_df['PredictionString'].notnull())['patientId'].count())
temp.columns = ['Count of Patients']
print(temp)
print('\nPatients that may have pneumonia: {:0.2f}%'.format(935/3000))
# Out of 3000 patients in the test set, 935 appear to have pneumonia for whom one or more bounding boxes are predicted by our model.
# Approximately 31% of patients are predicted to have pneumonia which is consistent with the training set data.
# Take a sample of images from the test set and plot the predicted bounding boxes on them
# This dataframe is to hold sample filenames
search_values = ['1f5f2bb2-acc2-4c18-b6d5-212f9a9980b5','0066ba32-08b6-4ac9-8d5a-abec69825d53','2b9990c6-dd10-44e4-9d5a-5701a964a57d']
data = submission_df[submission_df.patientId.str.contains('|'.join(search_values))]
data
# Selected 1 image with no predicted bounding box, 1 with a single bounding box and the last one with 2 predicted bounding boxes.
# This code is to plot the images for the above samples and their respective bounding boxes
img_data = list(data.T.to_dict().values())
f, ax = plt.subplots(1, 3, figsize=(15,5))
for i, data_row in enumerate(img_data):
# Get the patient info from the dcm image and plot the image
patid = data_row['patientId']
img_file = patid + '.dcm'
img = pydicom.dcmread(folder + img_file)
modality = img.Modality
age = img.PatientAge
sex = img.PatientSex
location = data_row['PredictionString']
if pd.notnull(location):
tgt = 1
else:
tgt = 0
ax[i].imshow(img.pixel_array, cmap=plt.cm.bone)
ax[i].axis('off')
ax[i].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}'.format(patid, modality, age, sex, tgt))
# If bounding boxes are predicted for the image, plot them on top of the image
if pd.notnull(location):
loc_list = location.split()
l = len(loc_list)//5
for j in range(l):
x = int(loc_list[j+1+(4*j)])
y = int(loc_list[j+2+(4*j)])
wid = int(loc_list[j+3+(4*j)])
hgt = int(loc_list[j+4+(4*j)])
rect = patches.Rectangle((x,y),wid,hgt,linewidth=1,edgecolor='r',facecolor='none')
ax[i].add_patch(rect)
plt.show()
As we can see, the model is predicting correctly.
Our model is able to predict the bounding boxes around Pneumonic patches with accuracy of 87% on test data. Out of 3000 patients in the test set, 935 appear to have pneumonia for whom one or more bounding boxes are predicted by our model. Approximately 31% of patients are predicted to have pneumonia which is consistent with the training set data.
This project uses recent techniques in the field of computer vision and deep learning and visualizations of key outcomes. This project will help health care professionals by making quick and accurate predictions of pneumonia in CXR images.